The code aims to develop a music recommendation system through Spotify's API, taking a user-specified list of songs and offering similar recommendations. Leveraging Spotify's extensive song data, the code processes features like valence and acousticness, filling in missing values and constructing numerical vectors for each input song. By calculating the cosine distance between the mean vector of input songs and the entire Spotify dataset, the system identifies closely related songs. It provides a curated list of recommended songs, filtering out those already in the input list, making it a foundational component for a personalized music recommendation platform. The system efficiently handles missing data and notifies users about songs not found in the Spotify database, enhancing the user experience.nterface for practical use.¶

Importing Required Libraries¶

In [1]:
import os
import numpy as np
import pandas as pd

import seaborn as sns
import plotly.express as px 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings("ignore")

Reading Dataets¶

In [2]:
data = pd.read_csv("data.csv")
genre_data = pd.read_csv('data_by_genres.csv')
year_data = pd.read_csv('data_by_year.csv')

Printing Read Datasets¶

In [3]:
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      170653 non-null  object 
 17  speechiness       170653 non-null  float64
 18  tempo             170653 non-null  float64
dtypes: float64(9), int64(6), object(4)
memory usage: 24.7+ MB
None
In [4]:
data.head()
Out[4]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
0 0.0594 1921 0.982 ['Sergei Rachmaninoff', 'James Levine', 'Berli... 0.279 831667 0.211 0 4BJqT0PrAfrxzMOxytFOIz 0.878000 10 0.665 -20.096 1 Piano Concerto No. 3 in D Minor, Op. 30: III. ... 4 1921 0.0366 80.954
1 0.9630 1921 0.732 ['Dennis Day'] 0.819 180533 0.341 0 7xPhfUan2yNtyFG0cUWkt8 0.000000 7 0.160 -12.441 1 Clancy Lowered the Boom 5 1921 0.4150 60.936
2 0.0394 1921 0.961 ['KHP Kridhamardawa Karaton Ngayogyakarta Hadi... 0.328 500062 0.166 0 1o6I8BglA6ylDMrIELygv1 0.913000 3 0.101 -14.850 1 Gati Bali 5 1921 0.0339 110.339
3 0.1650 1921 0.967 ['Frank Parker'] 0.275 210000 0.309 0 3ftBPsC5vPBKxYSee08FDH 0.000028 5 0.381 -9.316 1 Danny Boy 3 1921 0.0354 100.109
4 0.2530 1921 0.957 ['Phil Regan'] 0.418 166693 0.193 0 4d6HGyGT8e121BsdKmw9v6 0.000002 3 0.229 -10.096 1 When Irish Eyes Are Smiling 2 1921 0.0380 101.665
In [5]:
print(genre_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2973 entries, 0 to 2972
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mode              2973 non-null   int64  
 1   genres            2973 non-null   object 
 2   acousticness      2973 non-null   float64
 3   danceability      2973 non-null   float64
 4   duration_ms       2973 non-null   float64
 5   energy            2973 non-null   float64
 6   instrumentalness  2973 non-null   float64
 7   liveness          2973 non-null   float64
 8   loudness          2973 non-null   float64
 9   speechiness       2973 non-null   float64
 10  tempo             2973 non-null   float64
 11  valence           2973 non-null   float64
 12  popularity        2973 non-null   float64
 13  key               2973 non-null   int64  
dtypes: float64(11), int64(2), object(1)
memory usage: 325.3+ KB
None
In [6]:
genre_data.head()
Out[6]:
mode genres acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
0 1 21st century classical 0.979333 0.162883 1.602977e+05 0.071317 0.606834 0.361600 -31.514333 0.040567 75.336500 0.103783 27.833333 6
1 1 432hz 0.494780 0.299333 1.048887e+06 0.450678 0.477762 0.131000 -16.854000 0.076817 120.285667 0.221750 52.500000 5
2 1 8-bit 0.762000 0.712000 1.151770e+05 0.818000 0.876000 0.126000 -9.180000 0.047000 133.444000 0.975000 48.000000 7
3 1 [] 0.651417 0.529093 2.328809e+05 0.419146 0.205309 0.218696 -12.288965 0.107872 112.857352 0.513604 20.859882 7
4 1 a cappella 0.676557 0.538961 1.906285e+05 0.316434 0.003003 0.172254 -12.479387 0.082851 112.110362 0.448249 45.820071 7
In [7]:
print(year_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mode              100 non-null    int64  
 1   year              100 non-null    int64  
 2   acousticness      100 non-null    float64
 3   danceability      100 non-null    float64
 4   duration_ms       100 non-null    float64
 5   energy            100 non-null    float64
 6   instrumentalness  100 non-null    float64
 7   liveness          100 non-null    float64
 8   loudness          100 non-null    float64
 9   speechiness       100 non-null    float64
 10  tempo             100 non-null    float64
 11  valence           100 non-null    float64
 12  popularity        100 non-null    float64
 13  key               100 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 11.1 KB
None
In [8]:
year_data.head()
Out[8]:
mode year acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence popularity key
0 1 1921 0.886896 0.418597 260537.166667 0.231815 0.344878 0.205710 -17.048667 0.073662 101.531493 0.379327 0.653333 2
1 1 1922 0.938592 0.482042 165469.746479 0.237815 0.434195 0.240720 -19.275282 0.116655 100.884521 0.535549 0.140845 10
2 1 1923 0.957247 0.577341 177942.362162 0.262406 0.371733 0.227462 -14.129211 0.093949 114.010730 0.625492 5.389189 0
3 1 1924 0.940200 0.549894 191046.707627 0.344347 0.581701 0.235219 -14.231343 0.092089 120.689572 0.663725 0.661017 10
4 1 1925 0.962607 0.573863 184986.924460 0.278594 0.418297 0.237668 -14.146414 0.111918 115.521921 0.621929 2.604317 5

Data Understanding by Visualization¶

Using yellowbrick finding corelation as "popularity" target¶

In [9]:
from yellowbrick.target import FeatureCorrelation

feature_names = ['acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence','duration_ms','explicit','key','mode','year']

X, y = data[feature_names], data['popularity']

# Create a list of the feature names
features = np.array(feature_names)

# Instantiate the visualizer
visualizer = FeatureCorrelation(labels=features)

plt.rcParams['figure.figsize']=(20,20)
visualizer.fit(X, y)     # Fit the data to the visualizer
visualizer.show()
No description has been provided for this image
Out[9]:
<Axes: title={'center': 'Features correlation with dependent variable'}, xlabel='Pearson Correlation'>

Music over Time in Form of Graphs¶

In [10]:
def get_decade(year):
    period_start = int(year/10) * 10
    decade = '{}s'.format(period_start)
    return decade

data['decade'] = data['year'].apply(get_decade)

sns.set(rc={'figure.figsize':(11 ,6)})
sns.countplot(data['decade'])
Out[10]:
<Axes: xlabel='count', ylabel='decade'>
No description has been provided for this image

Using the data grouped by year, we can understand how the overall sound of music has changed from 1921 to 2020.¶

In [11]:
sound_features = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'valence']
fig = px.line(year_data, x='year', y=sound_features)
fig.show()

Visualization to compare different genres and understand their unique differences in sound.¶

In [12]:
top10_genres = genre_data.nlargest(10, 'popularity')

fig = px.bar(top10_genres, x='genres', y=['valence', 'energy', 'danceability', 'acousticness'], barmode='group')
fig.show()

Simple K-means clustering algorithm is used to divide the genres in this dataset into ten clusters¶

In [13]:
cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=10))])
X = genre_data.select_dtypes(np.number)
cluster_pipeline.fit(X)
genre_data['cluster'] = cluster_pipeline.predict(X)

Visualizing the Clusters with t-SNE¶

In [14]:
tsne_pipeline = Pipeline([('scaler', StandardScaler()), ('tsne', TSNE(n_components=2, verbose=1))])
genre_embedding = tsne_pipeline.fit_transform(X)
projection = pd.DataFrame(columns=['x', 'y'], data=genre_embedding)
projection['genres'] = genre_data['genres']
projection['cluster'] = genre_data['cluster']

fig = px.scatter(
    projection, x='x', y='y', color='cluster', hover_data=['x', 'y', 'genres'])
fig.show()
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 2973 samples in 0.500s...
[t-SNE] Computed neighbors for 2973 samples in 0.362s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2973
[t-SNE] Computed conditional probabilities for sample 2000 / 2973
[t-SNE] Computed conditional probabilities for sample 2973 / 2973
[t-SNE] Mean sigma: 0.777516
[t-SNE] KL divergence after 250 iterations with early exaggeration: 76.106277
[t-SNE] KL divergence after 1000 iterations: 1.392739

Clustering Songs with K-Means¶

In [15]:
song_cluster_pipeline = Pipeline([('scaler', StandardScaler()), 
                                  ('kmeans', KMeans(n_clusters=20, 
                                   verbose=False))
                                 ], verbose=False)

X = data.select_dtypes(np.number)
number_cols = list(X.columns)
song_cluster_pipeline.fit(X)
song_cluster_labels = song_cluster_pipeline.predict(X)
data['cluster_label'] = song_cluster_labels

Visualizing the Clusters with PCA¶

In [16]:
pca_pipeline = Pipeline([('scaler', StandardScaler()), ('PCA', PCA(n_components=2))])
song_embedding = pca_pipeline.fit_transform(X)
projection = pd.DataFrame(columns=['x', 'y'], data=song_embedding)
projection['title'] = data['name']
projection['cluster'] = data['cluster_label']

fig = px.scatter(
    projection, x='x', y='y', color='cluster', hover_data=['x', 'y', 'title'])
fig.show()

Applying Random Forest as clustering Model¶

In [26]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Clustering with Random Forest
X_songs = data.select_dtypes(np.number)
number_cols = list(X_songs.columns)

# Use the 'cluster' column from 'genre_data' as the target variable for RandomForestClassifier
y_rf = genre_data['cluster']

# Ensure the number of samples in X_songs and y_rf are the same
X_songs, y_rf = X_songs[:len(y_rf)], y_rf[:len(X_songs)]

# Initialize and train the RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_songs, y_rf)

# Predict cluster labels with RandomForestClassifier
song_cluster_labels_rf = random_forest.predict(X_songs)

# Ensure index alignment between 'data' DataFrame and 'song_cluster_labels_rf' array
data = data.iloc[:len(song_cluster_labels_rf)]

# Assign the predicted labels to the 'cluster_label_rf' column in the 'data' DataFrame
data['cluster_label_rf'] = song_cluster_labels_rf

# Split data for accuracy comparison
X_train, X_test, y_train, y_test = train_test_split(X_songs, y_rf, test_size=0.2, random_state=42)

# Create a new RandomForestClassifier for the final evaluation
random_forest_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the RandomForestClassifier with the obtained labels
random_forest_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = random_forest_classifier.predict(X_test)

# Calculate accuracy for RandomForestClassifier
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {accuracy_rf * 100:.2f}%')
Random Forest Accuracy: 96.47%

Setting spotify client¶

In [27]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from collections import defaultdict

# Set your Spotify API credentials
os.environ["SPOTIFY_CLIENT_ID"] = "16904b3cecb0467aa989231676f07a8c"
os.environ["SPOTIFY_CLIENT_SECRET"] = "191b4d3b0fad4116914e1d83c3bade53"

# Create a Spotify client
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=os.environ["SPOTIFY_CLIENT_ID"],
                                                           client_secret=os.environ["SPOTIFY_CLIENT_SECRET"]))

def find_song(name, year):
    # Initialize a defaultdict to store song data
    song_data = defaultdict()

    # Search for the song using the Spotify API
    results = sp.search(q='track: {} year: {}'.format(name, year), limit=1)

    # Check if the search results contain any items
    if results['tracks']['items'] == []:
        return None

    # Extract relevant information from the search results
    results = results['tracks']['items'][0]
    track_id = results['id']
    audio_features = sp.audio_features(track_id)[0]

    # Populate the song_data dictionary with song details
    song_data['name'] = [name]
    song_data['year'] = [year]
    song_data['explicit'] = [int(results['explicit'])]
    song_data['duration_ms'] = [results['duration_ms']]
    song_data['popularity'] = [results['popularity']]

    # Add audio features to the song_data dictionary
    for key, value in audio_features.items():
        song_data[key] = value

    # Return the song data as a DataFrame
    return pd.DataFrame(song_data)
In [28]:
from collections import defaultdict
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist
import difflib

# List of numerical columns representing song features
number_cols = ['valence', 'year', 'acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
               'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'popularity', 'speechiness', 'tempo']

def get_song_data(song, spotify_data):
    try:
        # Retrieve song data from the Spotify dataset based on name and year
        song_data = spotify_data[(spotify_data['name'] == song['name']) 
                                & (spotify_data['year'] == song['year'])].iloc[0]
        return song_data
    
    except IndexError:
        # If the song is not found, attempt to find it using the find_song function
        return find_song(song['name'], song['year'])

def get_mean_vector(song_list, spotify_data):
    # List to store vectors representing features of each song in the input list
    song_vectors = []

    for song in song_list:
        song_data = get_song_data(song, spotify_data)

        if song_data is None:
            print('Warning: {} does not exist in Spotify or in the database'.format(song['name']))
            continue

        # Ensure all vectors have the same length by filling missing values and flattening
        song_vector = song_data[number_cols].fillna(0).values.ravel()

        # Append the flattened vector to the list
        song_vectors.append(song_vector)

    if not song_vectors:
        print("Error: No valid song vectors found.")
        return None

    # Convert the list of vectors into a NumPy array and calculate the mean vector
    song_matrix = np.array(song_vectors)
    return np.mean(song_matrix, axis=0)

def flatten_dict_list(dict_list):
    # Flatten a list of dictionaries into a dictionary of lists
    flattened_dict = defaultdict()
    for key in dict_list[0].keys():
        flattened_dict[key] = []
    
    for dictionary in dict_list:
        for key, value in dictionary.items():
            flattened_dict[key].append(value)
            
    return flattened_dict

def recommend_songs(song_list, spotify_data, n_songs=10):
    # List of metadata columns to be included in the recommendation output
    metadata_cols = ['name', 'year', 'artists']
    # Flatten the input list of songs into a dictionary
    song_dict = flatten_dict_list(song_list)
    
    # Calculate the mean vector representing the collective features of input songs
    song_center = get_mean_vector(song_list, spotify_data)

    # Assuming 'song_cluster_pipeline' is defined elsewhere, fetch the scaler
    scaler = song_cluster_pipeline.steps[0][1]
    # Transform the entire Spotify dataset using the scaler
    scaled_data = scaler.transform(spotify_data[number_cols])
    # Transform the mean vector of input songs using the scaler
    scaled_song_center = scaler.transform(song_center.reshape(1, -1))
    # Calculate cosine distances between the mean vector and all songs in the dataset
    distances = cdist(scaled_song_center, scaled_data, 'cosine')
    # Find the indices of songs with the lowest cosine distances
    index = list(np.argsort(distances)[:, :n_songs][0])
    
    # Retrieve recommended songs from the dataset based on the indices
    rec_songs = spotify_data.iloc[index]
    # Filter out songs that are already present in the input list
    rec_songs = rec_songs[~rec_songs['name'].isin(song_dict['name'])]
    # Convert the recommendation DataFrame to a list of dictionaries
    return rec_songs[metadata_cols].to_dict(orient='records')

Using recommend_songs function to let system recommend 10 songs¶

In [29]:
recommend_songs([{'name': 'Come As You Are', 'year':1991},
                {'name': 'Smells Like Teen Spirit', 'year': 1991},
                {'name': 'Lithium', 'year': 1992},
                {'name': 'All Apologies', 'year': 1993},
                {'name': 'Stay Away', 'year': 1993}],  data)
Out[29]:
[{'name': 'Trash Bags of That Sour', 'year': 1927, 'artists': "['Numba 9']"},
 {'name': 'Santa Claus Is Coming To Town',
  'year': 1934,
  'artists': "['The Moors']"},
 {'name': "Don't Run", 'year': 1921, 'artists': "['THE GUY']"},
 {'name': 'When We Die', 'year': 1921, 'artists': "['THE GUY']"},
 {'name': 'Power Is Power', 'year': 1921, 'artists': "['Zay Gatsby']"},
 {'name': "I Don't Mind",
  'year': 1927,
  'artists': "['Paul Bridgwater', 'Geoff Horgan']"},
 {'name': 'The Way We Are',
  'year': 1934,
  'artists': "['The Psychedelic Scorzonera']"},
 {'name': 'Last Fair Deal Gone Down',
  'year': 1936,
  'artists': '["Keb\' Mo\'"]'},
 {'name': 'Ole Bitch', 'year': 1927, 'artists': "['Numba 9']"},
 {'name': 'No Angles - Feel It Club Mix',
  'year': 1934,
  'artists': "['Mark Gamble', 'Mad Dog']"}]